jwdg's Posts

jwdg's Posts

Thanks Zaphod, that's really helpful - and some food for thought. Plenty of FAST Cache: understood, that makes sense. I assume that your 15 drives were 14 RAID1 pairs plus a hot spare? In ... See more...
Thanks Zaphod, that's really helpful - and some food for thought. Plenty of FAST Cache: understood, that makes sense. I assume that your 15 drives were 14 RAID1 pairs plus a hot spare? In your sample system did you just have 2 backend buses or did you use 4 or 6, for example? I'm not sure what the trigger point is for increasing the backend bus count (obviously the smaller models manage with just 2 anyway). 10K & 7.2K drives: I notice though that you've gone for HDDs rather than eMLC SSDs. I assume that is just a cost/GB factor? Likewise the choice of 10K instead of 15K drives is presumably a similar argument 2x VNX5200 instead of 1x5600 - that's an interesting thought in terms of phasing, as we could get some additional life out of one of our existing VNX5500s if we went down that route, but that's a decision others in my project may have to make dependent on how much capital expenditure they want to push out into the future. I understand the point about future expansion pricing and that is subject to the same project direction - I can see the discount advantage of buying upfront. Thanks again for your comment- that has been really useful in pointing out what I need to focus on.
Hi all Firstly, apologies if this is not a suitable subject to raise here - I have reasons (described below) for trying to clarify things before approaching a sales team! I'm in the process... See more...
Hi all Firstly, apologies if this is not a suitable subject to raise here - I have reasons (described below) for trying to clarify things before approaching a sales team! I'm in the process of trying to specify a storage system (which we will buy two instances of) to support a pair of basically identical virtual platforms (also likely to be near-identical in workload). My outline is all about *one* of these systems. In our case we already have a VNX5500 for each platform which we bought to support the pilot system (10TB usable 15K, 64TB usable 7.2K), and "sticking with what we know" a VNX2 is a likely successor. Perhaps in contrast to some use cases, our future workload for the platform is fairly well scoped. It will basically just be more of the same VM types as in the pilot. I have been able to dig through vSphere performance graphs to estimate the IOPS, bandwidth and capacity usage of different VM types (for many of these types the successor system will just have 8x the count of VMs compared to the current one). In the process of gathering this data, I've found some interesting things (today I discovered the Mitrend tool which has been valuable in characterising our workload): 1. Our workload in the pilot is currently typically 70-80% writes on a 95% IOPS of 1700. It appears that we have several DB servers that host what are actually quite small MS SQL databases which readily fit in server memory but which have quite high change rates (thus frequent transaction logging). I think the capacity of this very write-heavy data is probably only a few hundreds of GBs. We also have some continuous file writing from several of our applications. The loading seems very consistent, though it appears that an AV component install across our server estate pushed the IOPS to around 25K for a few hours (80% utilisation on one SP, 32% on the other so there was still capacity...) 2. The total final capacity will need to be between 180 and 350TB Some of the continuously written material (estimated to be around 150TB) will be written once (at a rate of around 400GB/day and retained for slightly over 12 months) but may never be read again. This block of data may be removed partly or wholly from my scope, thus the range of capacity values. This 400GB/day will have very few reads. 3. The estimated bandwidth is likely to be around 500-800MByte/s 4. The estimated throughput is 3K IOPS read, 5K write (based on a calculation of the slightly changed workload mix of the final system). So.. I have read the best practices for performance guide and my assumptions/questions are as follows: a. In order to get the capacity I need will take (say) 120 7.2K 4TB drives, RAID6 b. I don't know whether I need any other spinning disks, or just have an eMLC flash tier (15 disks?). I did try the VNX Skew report on Mitrend, but all it delivered was a config summary unfortunately, and no heatmaps (I'll pursue that as a seperate enquiry). c. I'm assuming I would have some FAST Cache drives (how many?) d. I don't know whether it is worth the overhead of separating the workload into different pools (I suppose the obvious case would be the long tail data being in an (almost?) entirely NL-SAS pool) e. There may be an interest in staging some of the purchase - the capacity requirement will take some time to grow as the project turns on the workloads. I realise that some of this is advice I could get through an EMC partner - however, it is likely (due to my organisation's policy) that we will need to procure this on a competitive basis from a pre-established supplier list (which contains several EMC partners). If we were to offer what I've written above to several partners would it be sufficient to produce a useful BoM without great wasted effort for the unsuccessful suppliers? Is there a process for asking EMC to transform this kind of summary into a BoM? It may be that what I really need is a small bit of consultancy with someone more knowledgeable, though I'd have to venture into the bureaucracy of procuring that! Finally - part of my rationale for posting here is the hope that others might find the design discussion interesting.  Thanks to those who have spent their time reading this far... I shan't be offended if the length causes many of you to skip it altogether! John
Thanks Glen I would be keen to restore the ATS functionality (which I think is HardwareAcceleratedLocking - see ETA 207784: VNX: Storage Processors may restart if VMware vStorage APIs for Arra... See more...
Thanks Glen I would be keen to restore the ATS functionality (which I think is HardwareAcceleratedLocking - see ETA 207784: VNX: Storage Processors may restart if VMware vStorage APIs for Array Integration (VAAI) is enabled resulting in potential data unavailability.) - I think our decision was based on that discussion or an earlier version of the ETA before there was a hotfix available for our OE version, I'll try to tackle getting that hotfix and installing it so we can turn on ATS again! I have replaced the disk , but as I can't reopen the "CallHome" SR (it keeps getting automatically closed because I've been sent a replacement disk!) I've now opened another SR (76519062) to chase the underlying problem. I think this saga might be a case study worthy of review by someone who has oversight of the EMC support process... John
Thanks Glen On the subject of the SCSI reservations and VAAI/ATS - I think we currently have ATS disabled per ETA207784 until we can schedule an upgrade to the latest OE version (we're in a ch... See more...
Thanks Glen On the subject of the SCSI reservations and VAAI/ATS - I think we currently have ATS disabled per ETA207784 until we can schedule an upgrade to the latest OE version (we're in a change freeze for Christmas and New Year and the OE upgrade is probably not enough of an emergency to bypass that). Drive firmware - it's on the list to do early in the new year (probably even sooner than the OE upgrade). iSCSI Logouts:I can see why the transfer of data to the hot spare might load the system, but the ESXi logs suggest *very* high latency: "Long VMFS rsv time on 'V2 BULK 20' (held for 4417 msecs)" makes me wonder if the SP is stalling completely for several seconds? I haven't replaced the drive yet, perhaps because I was hoping to trigger an escape from what has become a repetitive process. My concern is that I don't know how to get EMC to perform proper analysis on a superficially functional array with a repetitive fault, which is why I had wanted to leave the fault present. Perhaps I need to replace the drive (I think this will be the 6th replacement) and *then* raise a new SR to request some analysis of why the drives keep failing?
I haven't reseated anything yet- is that any more involved than just loosening the thumbscrews, pulling the LCC out and replacing it? How disruptive is that to normal array services, assuming the... See more...
I haven't reseated anything yet- is that any more involved than just loosening the thumbscrews, pulling the LCC out and replacing it? How disruptive is that to normal array services, assuming the other side SP is operating normally? (This might influence when I do it to avoid disruption to the VMs running off the storage).
Thanks again for the pointers, Adham. My optimism has declined a bit. I've been told that the iSCSI logout is a network congestion symptom (though they are very highly correlated with the bus err... See more...
Thanks again for the pointers, Adham. My optimism has declined a bit. I've been told that the iSCSI logout is a network congestion symptom (though they are very highly correlated with the bus errors) and that as the drive wasn't replaced when it was reseated, I needed to replace it again. I did that In the end the original drive replacement (aside from the reseating event) was sufficiently far in the past that I was asked to replace the drive (which we arranged and was done on the 10th). The slot failed again at 9PM last night with the same pattern - a burst of Soft SCSI Bus Errors for the slot, shutdown errors for the *adjacent* drive and iSCSI timeouts for the ESXi hosts (thankfully 9PM on weekdays is quiet for our usage scenario). For anyone at EMC who'd like to review, the latest fault (with today's SPCollects) is on 76183538. The previous case (at the start of this thread) was 75734958 and the previous attempt to that was 74134482. Thanks John
Thanks Adham. I have got an SR open and had a phone conversation this morning about it which may be leading somewhere other than another replacement drive (perhaps an LCC reseat)! In terms of ... See more...
Thanks Adham. I have got an SR open and had a phone conversation this morning about it which may be leading somewhere other than another replacement drive (perhaps an LCC reseat)! In terms of your questions, though: The SCSI bus errors came as a burst of 84 soft SCSI Bus Errors errors logged on SP B for 0/2/5  from 00:28:41 to 00:29:09 (17 were logged from 00:28:41 to 00:28:43 on SP A). The next significant event on SP B is the "drive handler" offline I quoted above at 01:25:24. I haven't spotted any media errors on the affected drive prior to that (whereas I've seen that for other drives prior to them failing). Thanks again fior the response John
I wonder if anyone has seen a similar thing before. I have a VNX5500 to which I added an additional enclosure and drives back in June. On the 29th August, one of the drives in the new enclosure f... See more...
I wonder if anyone has seen a similar thing before. I have a VNX5500 to which I added an additional enclosure and drives back in June. On the 29th August, one of the drives in the new enclosure failed (prior to this, there were numerous Soft SCSI Bus Errors for the drive concerned): Brief Description: CLARiiON call-home event number 0x712789a0 Host SPA Storage Array CKMxxxxxx SP N/A SoftwareRev 7.33.2 (0.51) BaseRev 05.32.000.5.209 Description Drive(Bus 0 Encl 2 Slot 5) taken offline. SN:Z1xxxxxx . TLA:005050144PWR. Reason:Drive Handler(0x00c3) A replacement drive was shipped, but when inserted on 2nd September it wasn't recognised (it appeared to fail before the replacement wizard completed). Another replacement was shipped and failed about 6 hours after installation. Another replacement was shipped and lasted around 9 days, until the 18th September when it failed too. On that occasion we noticed that several LUNs went intermittently unavailable for a couple of minutes (causing some user disruption) when the drive "failed". Eventually, as part of some information gathering, a colleague inadvertently pulled out and reseated the "failed" drive - at which point (the 29th October)  it came back online and has worked until a couple of days ago (when it went offline with similar LUN issues, though these were less disruptive as they happened in the middle of the night), I think another drive is on its way to us as I write! Interestingly, when it failed one of the SPs logged a set of Unit Shutdown errors for the *adjacent* drive... whether those are the cause of the LUN disruption, I don't know. I'm assuming this is not a common situation, but I've made little headway in getting it investigated so wonder if the wider community has any ideas? Thanks John
I am responsible for two VNX5500 unified systems. I am in the UK, which has an electrical safety testing regime which (coupled with my employer's policies) requires equipment to undertgo electric... See more...
I am responsible for two VNX5500 unified systems. I am in the UK, which has an electrical safety testing regime which (coupled with my employer's policies) requires equipment to undertgo electrical safety inspections and tests. Both VNX5500 systems were tested on initial installation but an additional requirement has arisen for one of the systems to be retested. For those not familiar with the UK testing regime, this will require each 230V-powered device to be disconnected from its supply and conencted to a test set - the test takes around 30 seconds and then the normal power is reconnected. For the DAEs we are assuming that removing only one inlet at a time (probably doing Power A on all DAEs, then Power B) will not cause any disruption. Likewise for the DME. For the control station I could just use a standard Linux shutdown prior to the test (is this reasonable and without hazard?) The more significant question is how the testing of the DPE PSUs and SPS should be performed. Both the SPS and DPE PSU will need to be separately tested - are the DPE PSUs cross-wired to support both SPs or should I assume that when the PSU in SP A is disconnected, SP A goes off? In which case, is the following a sensible approach: 1. Shutdown SP A (by turning off the SPS switch, or via Unishpere or CLI? What's the least disruptive way to do this with a multipathed VMWare environment?) 2. Disconnect SPS-SP A power cable. 3. Test SP A PSU and SPS side A. 4. Reconnect SPS-SP A power cable, turn on SPS A power switch. 5. Wait for full operation of SP A and repeat for SP B This system is not in production, so a full outage is possible if a partial shutdown is not, but it is useful for us to understand the logic. While this may seem a very basic question, I can't find any clear references to performing something like this in the available manuals. A reference would be welcome if there is a document for this. Many thanks John