Welcome to the EMC Support Community Ask the Expert conversation.
This is an opportunity to learn and discuss Proactive activities to move toward five 9's availability on CLARiiON / VNX.
This discussion begins on Monday, Feburary 18th and concludes on March 1st. Get ready by bookmarking this page or signing up for e-mail notifications.
Mark Cox is a Technical Account Manager (TAM) with over 38 years of experience in the IT industry, and has been with EMC for 21 of those 38 years. The TAM practice is a post sales global community of technical expertise, logically grouped around a defined set of products, technologies, and services, aligned to pro-actively help customers promote best practices and avoid problems. Mark's role before joining the TAM Team was as an Advisory Field Support Specialist with primary responsibility for the VNX / CLARiiON Product lines which include both File and Block configurations.
This Ask the Expert discussion is now open. We look forward to your questions and interactive discussion.
In an effort to open the discussion I will offer a snippet of information that speaks to Availability.
Availability measures the percentage of time the system is able to return data when requested by
the client. Degraded performance is not included in the metric.
For example, five 9s availability means 99.999 percent availability. This percentage translates into
a total data-not-available time of about five minutes and fifteen seconds per year.
Redundancy has many levels. Our first priority is to the data and after that the operations that use the data. We protect the data with different raid types based on priority and add in the additional protection of the hotspare. Today this is common throughout the industry and available with virtually all storage arrays
In regards to the operations that use the data and how we can protect that area of our operations we initially used the Host Cluster and shared data, duplicate HBA’s, duplicate switches and network paths. In the current environments we see more Virtual environments where the actual Virtual host is reduced to a set of files that can also be protected at the storage level.
With so many advances over the last couple of years in the Virtual environment we can now move virtual host across the network to different server hardware to take advantage of additional processor power during peak operating times or in the event of a server failure. With this capability we have virtual machines clustered across different physical servers creating a redundancy at the environment level.
I took a class on implementing Exchange where we had a Virtual machine clustered to another Virtual machine at a Disaster Recovery site. We were using a RP splitter attached to the array to replicate the data across the network for the other side of the Cluster. My thoughts are that this can be expensive and we would need to prioritize the operations in our environment based on our RTO, recovery time objective.
Thank You for your question and your interest.
thanks for the information. When design a system to achieve HA, do you consider performance factor? Even though we could have redundant components at hardware, server,connectivity, storage, etc..If the performance cannot afford the requirements of the application, we would still have down time, like OLTP application which requires strict response time. HA have nothing to do in such a bad response time case.
How do you think?
You are correct; however in regards to any specific piece of equipment we do not consider degraded performance in the metrics for five nines or 99.999% up time. I am in agreement that if the performance of the application is less than acceptable then we must consider the data as still unavailable for our requirements.
We can make the assumption that at some point in time we will experience a failure and we must consider what our RPO, recovery point objective is going to be. RPO is determined by the amount of time between data protection events, and reflects the amount of data that potentially could be lost during an outage. These intervals can be as small as seconds or as long as hours depending on our priorities and the SLA requirements that we have set for a specific application and our business model. An example might be that our online sales will take priority over a backup application. At the same time that we are looking at the amount of data that must be available we must also consider the RTO, recovery time objective, which is required to return the application and its data back to a functioning state and this also includes the performance of the application as you indicated.
I think that it is important to note here that the amount of data, the shorter the RPO and the shorter the RTO will increase our cost substantially. A simple backup and restore has little cost but the RPO is the time between backups and that data can potentially be lost. At the other end of the spectrum we are looking at a clustered environment with duplicate hardware and asynchronous copies of the data at the array level. This duplicated hardware environment has now doubled our initial capital cost and may be worth the expense depending on the business requirements. My personal thoughts are that we will need to be back online with as much as 20% of the data center immediately if possible and another 40% available within hours with no data loss. In that we have many options it will always come down to our business requirements and the cost of our chosen recovery solution.
Nice thought, Mark!
For VNX/CLARiiON, we do have redundant hardware level components to achieve HA, is there any software level elements related to HA requirements?
your five 9s, is this VNX 99.999% or TOTAL availbility from host to LUN? If it's the total, you also need to take into acount the availability of HBA, cables and SAN. If your SAN only has three 9s, you can't get five 9s availibility, even if the VNX has five! It'll be 99.9% x 99.999% = 99.899001, in fact to be scientifically correct, it's 99.9% because the way the 3 nines were presented (1 digit).
So besides the five 9s of your LUN accessibility you also need to take your whole SAN infrastructure into acount.
In the beginning of this discussion, it was directed at the VNX product line and how we can best achieve five 9's at the array level. However, if you are able to see the prior post you can see where the discussion has moved to talk about the environment. Even if we try to reach a five 9's in the environment we find that if we lose a path we can have a performance degradation that can result in performance being below an acceptable level and render data as essentially unavailable according to our business model or an SLA. I agree with your post. I also think that in todays environments that we must go a step further than just looking at an individual Host environment.